Skip to content

perf(vllm): fuse MiniMax M3 BF16 EP experts on MI300X#1782

Open
Oseltamivir wants to merge 4 commits into
feat/m3-mi300x-mxfp8from
codex/minimax-m3-mi300x-ep-mxfp8
Open

perf(vllm): fuse MiniMax M3 BF16 EP experts on MI300X#1782
Oseltamivir wants to merge 4 commits into
feat/m3-mi300x-mxfp8from
codex/minimax-m3-mi300x-ep-mxfp8

Conversation

@Oseltamivir

@Oseltamivir Oseltamivir commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • add a sparse local-route BF16 MoE path for the profiled MiniMax M3 TP8+EP8
    long-context shape on MI300X
  • discard remote routes during expert alignment and size buffers for 16 local
    experts instead of 128 global experts
  • use 16-row grouped-GEMM tiles that match the measured ~6.75 routes per local
    expert instead of the existing 64-row tile
  • fuse split SwiGLU-OAI into the BF16 GEMM1 epilogue, eliminating the standalone
    activation kernel and the 2x-intermediate GEMM1 output
  • retain the measured native/BF16 policy for short-context EP8

This PR is stacked on #1753 and contains only the incremental EP8 optimization.
It does not include the profiling branch, AITER allreduce/RMSNorm work,
temporary benchmark configuration, or perf-changelog.yaml changes.

Profile basis

The six-point MI300X profile found expert GEMM1+GEMM2 at 30.31 ms for 1k/c256
and 28.10 ms for 8k/c256. After collective fusion, expert GEMMs remained the
largest classified 8k/c256 phase at 28.79 ms across 114 calls.

At c256, MiniMax M3 has about 216 active tokens and top-k 4, or 864 routed rows
globally. EP8 owns 16 of 128 experts per rank, leaving about 108 local rows,
roughly 6.75 rows per local expert. The existing BF16 config uses a 64-row M
tile, so it can execute about 1,024 padded rows per rank for roughly 108 useful
rows. Global alignment also creates blocks for remote experts that do no useful
GEMM work.

Profile report:
https://github.com/SemiAnalysisAI/InferenceX/blob/profiling/experimental/minimax_m3_mi300x_profile.md

First-principles changes

  1. Alignment remaps and retains only locally owned routes. Its allocation bound
    is based on 16 local experts, while the device counter remains authoritative.
  2. GEMM1 and GEMM2 use BLOCK_SIZE_M=16, matching the observed route density
    and reducing padded expert-row computation by up to 4x versus the 64-row
    tile.
  3. GEMM1 loads each activation tile once, computes gate and up projections, and
    applies split SwiGLU-OAI before storing. This halves its BF16 output traffic
    and removes a separate activation launch.
  4. GEMM2 applies router weights in FP32 as before.
  5. The existing expert-map-aware fused reduction sums only local weighted rows.
    It avoids direct atomic accumulation, which the profile identified as a poor
    fit for the c256 top-k-4 shape.

The path is gated to the exact gfx94x MiniMax M3 EP8 BF16 shape. gfx95x and
other models/configurations are unchanged.

Validation

Static and local validation:

  • python -m pytest utils/matrix_logic/ -q: 156 passed
  • bash -n benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi300x.sh
  • runtime patch dry-runs and applies cleanly to the pinned image source
  • patched vLLM source passes Ruff, formatting, compileall, and
    git diff --check
  • upstream branch includes local-route GEMM, alignment-allocation, and
    expert-map reduction correctness tests

MI300X serving validation is pending infrastructure recovery. The exact six-job
matrix (c1/c16/c256 for 1k1k and 8k1k) was dispatched four times, but every
attempt failed before GPU allocation because the Slurm controller was
unreachable:

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27569397626


Note

Medium Risk
Touches inference hot-path MoE kernels and backend selection with fused numerics and atomic top-k reduction, though scope is tightly gated to a specific gfx94x MiniMax-M3 EP8 shape.

Overview
Adds a second runtime patch (minimaxm3_mi300x_ep_mxfp8.patch) to the MI300X MiniMax-M3 benchmark recipe and wires it in after the base MXFP8 patch, with oracle-marker checks so EP8 optimizations are applied idempotently.

EP8 backend split by context length: For profiled gfx94x MiniMax-M3 EP8, short context (max_model_len ≤ 4096) now selects the native MXFP8 backend with the existing mixed native/BF16 expert policy extended to EP8; long context keeps emulation but switches to a new sparse local-route BF16 path instead of treating all EP as slow emulation-only.

Long-context sparse BF16 path (emulation): When the exact EP8 shape matches, MoE runs only locally owned routes—expert alignment can use num_local_experts so buffers and padding are sized for ~16 local experts rather than 128 global—and uses 16-row grouped GEMM tiles. GEMM1 is fused with split SwiGLU-OAI via a new fused_moe_gated_kernel; GEMM2 still applies router weights in the expert GEMM, and the fused top-k reduction can skip re-multiplying weights (apply_weights=False).

Native MXFP8 EP improvements: Grouped GEMM launch bounds use a local-expert-aware _max_post_padded; GEMM2 can fuse top-k reduction with relaxed atomics; SwiGLU+MXFP8 quant gains a route-aware variant that only processes aligned local rows after GEMM1.

Changes are gated to the profiled MiniMax-M3 EP8 configuration on gfx94x; other models and platforms are intended to be unchanged.

Reviewed by Cursor Bugbot for commit 16c596a. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@Oseltamivir Oseltamivir marked this pull request as ready for review June 16, 2026 17:53
@claude

claude Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Claude finished @Oseltamivir's task in 7m 0s —— View job


Review of PR #1782

LGTM — no blocking issues found.

Summary: This is a well-structured incremental optimization for MiniMax M3 EP8 on MI300X. The patch adds two complementary paths (short-context native/BF16 dispatch via the oracle, long-context sparse BF16 emulation with fused SwiGLU), and the gating logic correctly ensures each path is only reached for the profiled configuration. Key observations:

  • Correctness of the fused gated kernel: fused_moe_gated_kernel correctly computes both gate and up projections in a single pass, preserves the BF16 store/reload boundary before applying SwiGLU in FP32 (matching the unfused numerical behavior), and writes per-route rows to the activation buffer. The b_up_ptrs = b_gate_ptrs + N * stride_bn offset correctly addresses the split layout.
  • Local-expert allocation tightening: The num_local_experts parameter in moe_align_block_size correctly reduces the allocation from 128 * (block_size-1) to 16 * (block_size-1) padding tokens. The _max_post_padded bound is sound — min(allocation, M*block_m, M + E*(block_m-1)) takes the tightest of three independent bounds.
  • EP route filtering in emulation path: The _apply_sparse_mi300x_ep method correctly passes ignore_invalid_experts=True and num_local_experts=E to _prepare_expert_assignment, so remote-expert routes are dropped from sorted_token_ids. The downstream moe_fused_mul_sum with apply_weights=False and expert_map correctly zeros contributions from non-local experts (via tl.where(expert_mask, b_val, 0.0)), so unwritten rows in the workspace don't leak garbage.
  • Fused top-k atomic path in native MXFP8: FUSE_TOPK=True uses tl.atomic_add on a zero-initialized output with c_row = offs_token // top_k, correctly accumulating weighted per-route results into per-token output rows. The M_routed % top_k divisibility check guards against misuse.
  • Oracle routing: Short-context EP8 (max_model_len <= 4096) routes to Mxfp8NativeTritonExperts with mixed native/BF16 weights; long-context EP8 routes to Mxfp8EmulationTritonExperts with the sparse BF16 path. Both guards (_should_use_native_ep, _is_minimax_m3_mi300x_ep8) are scoped to the exact MiniMax M3 config.
  • Benchmark script: Patch application follows the existing idempotent marker pattern. EP flag is correctly conditional on EP_SIZE. Server launch uses proper line continuations.
  • Repo guidelines: No master config or perf-changelog changes needed (this PR only touches the benchmark script and runtime patch). No MTP or model prefix concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant